14 research outputs found
Tropical Geometry of Phylogenetic Tree Space: A Statistical Perspective
Phylogenetic trees are the fundamental mathematical representation of
evolutionary processes in biology. As data objects, they are characterized by
the challenges associated with "big data," as well as the complication that
their discrete geometric structure results in a non-Euclidean phylogenetic tree
space, which poses computational and statistical limitations. We propose and
study a novel framework to study sets of phylogenetic trees based on tropical
geometry. In particular, we focus on characterizing our framework for
statistical analyses of evolutionary biological processes represented by
phylogenetic trees. Our setting exhibits analytic, geometric, and topological
properties that are desirable for theoretical studies in probability and
statistics, as well as increased computational efficiency over the current
state-of-the-art. We demonstrate our approach on seasonal influenza data.Comment: 28 pages, 5 figures, 1 tabl
A Quasi-Likelihood Approach to Zero-Inflated Spatial Count Data
The increased accessibility of data that are geographically referenced and correlated increases the demand for techniques of spatial data analysis. The subset of such data comprised of discrete counts exhibit particular difficulties and the challenges further increase when a large proportion (typically 50% or more) of the counts are zero-valued. Such scenarios arise in many applications in numerous fields of research and it is often desirable to infer on subtleties of the process, despite the lack of substantive information obscuring the underlying stochastic mechanism generating the data. An ecological example provides the impetus for the research in this thesis: when observations for a species are recorded over a spatial region, and many of the counts are zero-valued, are the abundant zeros due to bad luck, or are aspects of the region making it unsuitable for the survival of the species? In the framework of generalized linear models, we first develop a zero-inflated Poisson generalized linear regression model, which explains the variability of the responses given a set of measured covariates, and additionally allows for the distinction of two kinds of zeros: sampling ("bad luck" zeros), and structural (zeros that provide insight into the data-generating process). We then adapt this model to the spatial setting by incorporating dependence within the model via a general, leniently-defined quasi-likelihood strategy, which provides consistent, efficient and asymptotically normal estimators, even under erroneous assumptions of the covariance structure. In addition to this advantage of robustness to dependence misspecification, our quasi-likelihood model overcomes the need for the complete specification of a probability model, thus rendering it very general and relevant to many settings. To complement the developed regression model, we further propose methods for the simulation of zero-inflated spatial stochastic processes. This is done by deconstructing the entire process into a mixed, marked spatial point process: we augment existing algorithms for the simulation of spatial marked point processes to comprise a stochastic mechanism to generate zero-abundant marks (counts) at each location. We propose several such mechanisms, and consider interaction and dependence processes for random locations as well as over a lattice
-Means Clustering for Persistent Homology
Persistent homology is a methodology central to topological data analysis
that extracts and summarizes the topological features within a dataset as a
persistence diagram; it has recently gained much popularity from its myriad
successful applications to many domains. However, its algebraic construction
induces a metric space of persistence diagrams with a highly complex geometry.
In this paper, we prove convergence of the -means clustering algorithm on
persistence diagram space and establish theoretical properties of the solution
to the optimization problem in the Karush--Kuhn--Tucker framework.
Additionally, we perform numerical experiments on various representations of
persistent homology, including embeddings of persistence diagrams as well as
diagrams themselves and their generalizations as persistence measures; we find
that clustering performance directly on persistence diagrams and measures
outperform their vectorized representations.Comment: 20 pages, 6 figure
Fast Topological Signal Identification and Persistent Cohomological Cycle Matching
Within the context of topological data analysis, the problems of identifying
topological significance and matching signals across datasets are important and
useful inferential tasks in many applications. The limitation of existing
solutions to these problems, however, is computational speed. In this paper, we
harness the state-of-the-art for persistent homology computation by studying
the problem of determining topological prevalence and cycle matching using a
cohomological approach, which increases their feasibility and applicability to
a wider variety of applications and contexts. We demonstrate this on a wide
range of real-life, large-scale, and complex datasets. We extend existing
notions of topological prevalence and cycle matching to include general
non-Morse filtrations. This provides the most general and flexible
state-of-the-art adaptation of topological signal identification and persistent
cycle matching, which performs comparisons of orders of ten for thousands of
sampled points in a matter of minutes on standard institutional HPC CPU
facilities
Probability Metrics for Tropical Spaces of Different Dimensions
The problem of comparing probability distributions is at the heart of many
tasks in statistics and machine learning and the most classical comparison
methods assume that the distributions occur in spaces of the same dimension.
Recently, a new geometric solution has been proposed to address this problem
when the measures live in Euclidean spaces of differing dimensions. Here, we
study the same problem of comparing probability distributions of different
dimensions in the tropical geometric setting, which is becoming increasingly
relevant in computations and applications involving complex, geometric data
structures. Specifically, we construct a Wasserstein distance between measures
on different tropical projective tori - the focal metric spaces in both theory
and applications of tropical geometry - via tropical mappings between
probability measures. We prove equivalence of the directionality of the maps,
whether starting from the lower dimensional space and mapping to the higher
dimensional space or vice versa. As an important practical implication, our
work provides a framework for comparing probability distributions on the spaces
of phylogenetic trees with different leaf sets.Comment: 15 page
Topological Data Analysis of Database Representations for Information Retrieval
Appropriately representing elements in a database so that queries may be
accurately matched is a central task in information retrieval. This recently
has been achieved by embedding the graphical structure of the database into a
manifold so that the hierarchy is preserved. Persistent homology provides a
rigorous characterization for the database topology in terms of both its
hierarchy and connectivity structure. We compute persistent homology on a
variety of datasets and show that some commonly used embeddings fail to
preserve the connectivity. Moreover, we show that embeddings which successfully
retain the database topology coincide in persistent homology. We introduce the
dilation-invariant bottleneck distance to capture this effect, which addresses
metric distortion on manifolds. We use it to show that distances between
topology-preserving embeddings of databases are small.Comment: 15 pages, 7 figure
Recommended from our members
Quantitative Analysis of Immune Infiltrates in Primary Melanoma.
Novel methods to analyze the tumor microenvironment (TME) are urgently needed to stratify melanoma patients for adjuvant immunotherapy. Tumor-infiltrating lymphocyte (TIL) analysis, by conventional pathologic methods, is predictive but is insufficiently precise for clinical application. Quantitative multiplex immunofluorescence (qmIF) allows for evaluation of the TME using multiparameter phenotyping, tissue segmentation, and quantitative spatial analysis (qSA). Given that CD3+CD8+ cytotoxic lymphocytes (CTLs) promote antitumor immunity, whereas CD68+ macrophages impair immunity, we hypothesized that quantification and spatial analysis of macrophages and CTLs would correlate with clinical outcome. We applied qmIF to 104 primary stage II to III melanoma tumors and found that CTLs were closer in proximity to activated (CD68+HLA-DR+) macrophages than nonactivated (CD68+HLA-DR-) macrophages (P < 0.0001). CTLs were further in proximity from proliferating SOX10+ melanoma cells than nonproliferating ones (P < 0.0001). In 64 patients with known cause of death, we found that high CTL and low macrophage density in the stroma (P = 0.0038 and P = 0.0006, respectively) correlated with disease-specific survival (DSS), but the correlation was less significant for CTL and macrophage density in the tumor (P = 0.0147 and P = 0.0426, respectively). DSS correlation was strongest for stromal HLA-DR+ CTLs (P = 0.0005). CTL distance to HLA-DR- macrophages associated with poor DSS (P = 0.0016), whereas distance to Ki67- tumor cells associated inversely with DSS (P = 0.0006). A low CTL/macrophage ratio in the stroma conferred a hazard ratio (HR) of 3.719 for death from melanoma and correlated with shortened overall survival (OS) in the complete 104 patient cohort by Cox analysis (P = 0.009) and merits further development as a biomarker for clinical application
Tropical Foundations for Probability & Statistics on Phylogenetic Tree Space
PreprintWe introduce a novel framework for the statistical analysis of phylogenetic trees: Palm tree space is con- structed on principles of tropical algebraic geometry, and represents phylogenetic trees as a point in a space endowed with the tropical metric. We show that palm tree space possesses a variety of properties that allow for the definition of probability measures, and thus expectations, variances, and other fundamental statistical quantities. This provides a new, tropical basis for a statistical treatment of evolutionary biological processes represented by phylogenetic trees. In particular, we show that a geometric approach to phylogenetic tree space — first introduced by Billera, Holmes, and Vogtmann, which we reinterpret in this paper via tropical geometry — results in analytic, geometric, and topological characteristics that are desirable for probability, statistics, and increased computational efficiency